Fix travis - job timeout and test timeout #203

zachyee · 2016-07-07T20:41:23Z

Created more jobs to fix the case where a single job would take more than 50 minutes to complete.

Moved creating the docker images for the product tests to before the test executes rather than during the test itself. This helps tests avoid their 10 minute limit for not outputting anything.

Added some logging that can be used in the future to figure out why tests are taking a long time (if this ever happens).

Works successfully with presto version 148, but not with the newest version. I can hardcode the URL where the tests download the presto rpm to fetch the 148 version rather than the newest version. This might be useful to test the release? (This PR fetches 149, not 148, so it will fail on travis)

petroav · 2016-07-07T21:11:55Z

.travis.yml

+    - PYTHONPATH=$(pwd)
+  matrix:
+    - OTHER_TESTS=true
+    - PRODUCT_TEST_GROUP_SA_BARE="tests/product/test_installer.py tests/product/standalone/test_installation.py"


What does SA stand for?

EDIT: probably stands for standalone.

it is not clear that SA means that, what about YS? Please do not use shortcuts.

Yea, maybe you can rename the groups to STANDALONE_BARE_TESTS and YARN_SLIDER_TESTS. Including the PRODUCT_TEST_GROUP_* prefix is a little too verbose and redundant.

kokosing · 2016-07-08T05:19:02Z

.travis.yml

+    - PRODUCT_TEST_GROUP=11
+    - PRODUCT_TEST_GROUP=12
+    - PRODUCT_TEST_GROUP=13
+    - PRODUCT_TEST_GROUP=14


travis not only fails because of timeouts. Travis has low amount of resources. There are tests which spin up a docker cluster which consists of 4 docker containers where each tries to start presto. In such case travis kills some processes as the memory usage is exceeding travis capacity. I think this is the most serious issue.

Moreover, when a process is killed, then presto-admin waits until timeout to reports that presto-server didn't show up. Thus making tests last longer.

This was to fix jobs going over 50 minutes and failing. I wasn't considering Travis memory usage as part of this PR.

kokosing · 2016-07-08T05:42:49Z

Travis didn't pass ; /

zachyee · 2016-07-08T16:45:38Z

haha yea i noted when i made the PR that travis still wouldn't work with the newest version of presto (which this branch uses). i created an experiment branch that uses 148 and those builds have worked.

petroav · 2016-07-08T18:42:49Z

tests/product/image_builder.py

+                             "standalone bare, and yarn-slider presto-admin only images",
+                        action="store_true")
+
+    if len(argv) <= 1:


I think it is best to re-write the argument parsing to use a positional argument. That way you can more succinctly specify that you require at least one argument and you can specify what the available choices are. Here's what I propose:

if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("image_type", metavar="image_type", type=str, nargs="+", choices=["standalone_presto", "standalone_pa", "standalone_bare", "yarn_slider_pa", "all"], help="Specify the type of " "image to create. The available choices are: standalone_presto, " "standalone_pa, standalone_bare, yarn_slider_pa, all") image_builder = ProductTestImageBuilder() args = parser.parse_args() if "all" in args.image_type: image_builder.test_setup_standalone_pa_only_images() image_builder.test_setup_standalone_pa_only_images() image_builder.test_setup_standalone_bare_images() image_builder.test_setup_yarn_slider_pa_only_images() else: if "standalone_presto" in args.image_type: image_builder.test_setup_standalone_presto_images() if "standalone_pa" in args.image_type: image_builder.test_setup_standalone_pa_only_images() if "standalone_bare" in args.image_type: image_builder.test_setup_standalone_bare_images() if "yarn_slider_pa" in args.image_type: image_builder.test_setup_yarn_slider_pa_only_images()

sure this seems a little cleaner

petroav · 2016-07-08T20:04:38Z

You should squash pretty much all of your commits. At the moment they're broken down at a level that is too granular. I think you should have two commits: one that refactors travis.yml as well as the refactor of BaseProductTest and another commit that introduces the timing decorator.

I would also like @ebd2 to take a look at this when he gets back on Monday.

ebd2 · 2016-07-11T17:59:28Z

.travis.yml

+    - PRODUCT_TEST_GROUP_YS_PA="tests/product/yarn_slider/test_slider_installation.py"
+    #-QUARANTINED="tests/product/yarn_slider/test_server_install.py"
+install:
+  - pip install --upgrade pip==6.1.1


You and @kokosing are going to need to coordinate some on this. See #195

kokosing · 2016-07-15T05:41:16Z

.travis.yml

+    elif [ -v OTHER_TESTS ]; then
+      make clean lint dist docs test-rpm
+      nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.unit
+      nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.integration


I think there is no need to set timers for unit and integration tests, they are really short.

kokosing · 2016-07-15T05:46:28Z

20 jobs travis sounds like an overkill, note that you are running tox -e p26 on machines which requested python 2.7

petroav · 2016-07-15T15:32:41Z

tests/product/timing_test_decorator.py

+from time import time
+
+
+def _initialize_module_logger(module_name):


So I suggest we remove this method completely and make this piece of code execute at the module level. In the functions where you refer to the logger variable make sure you include the line global logger. Also, I suggest we hardcode the module to "root". I think that's fine because the reason why you would want to have different loggers for every module is to be able to adjust the logging level and logging destination at a more granular level whereas we just print everything at INFO to the console.

I did some experiments and determined that if the code is moved to be module level it will only get executed once.

If a product test case is taking a long time, this decorator can be added to the test and some of its helper function to see where and why the function is taking a long time.

zachyee · 2016-07-15T23:31:06Z

@kokosing good point. I think the tests are grouped as optimally as possible, given how long they take right now. I think the only solution is to rework the tests to run faster (like you said). It also doesn't help that the tests have to be run with Python 2.6, 2.7, which automatically doubles the number of jobs per build. There's no way to work around this is there?

@petroav is there a ticket filed somewhere for trying to make the tests fun quicker so we can have less jobs per build or should I do that? I also fixed up the logging in timing_test_decorator.py. You can double-check to make sure everything looks good there.

@ebd2 i separated image_builder.py to just build images (_setup_image()) and gave the BaseProductTestCase a setup_cluster() function to use. Can you double-check that all the logic is correct for both of those functions?

petroav · 2016-07-17T15:46:48Z

@zachyee nah, there's no ticket for speeding up the tests. Feel free to create one but don't work on it. If it takes 20 jobs to bring each run to under 50 minutes, or whatever the timeout is, then so be it. Have you tried with fewer jobs? Things might run faster now with the bigger machines so you should definitely see if you can optimize it.

ebd2 · 2016-07-18T20:12:07Z

tests/product/image_builder.py

+
+
+class ImageBuilder:
+    def __init__(self, caller):


The name caller suggests that it can be something pretty generic. I think we should do the following:

Let this take any testcase (and rename caller -> testcase)

Extract the cluster types into tests/product/cluster_types.py and import them here and in base_product_case.

That separates the two dependencies (testcase for assertions and cluster definitions) such that they can be satisfied independently.

ebd2 · 2016-07-18T20:22:01Z

Looks good modulo the small refactor and the naming nit.

kokosing · 2016-07-19T05:55:09Z

@zachyee Please make sure that the problem from prestodb/presto#5689 does not exists in your travis.yml script.

Test grouping is ok. Thanks.

Using tox and travis to provision different python version is redundant. We should choose one of these, I prefer travis. See #201. You are using tox -e py26... so even two version of python are requested only py26 is tested.

rschlussel-zz · 2016-07-19T15:19:04Z

.travis.yml

+  matrix:
+    - OTHER_TESTS=true
+    - SHORT_PRODUCT_TEST_GROUP=0
+    - LONG_PRODUCT_TEST_GROUP_STANDALONE_PRESTO_ADMIN="tests/product/test_server_install.py"


I'd get rid of the word standalone in all the test group names. It's the only way to run presto-admin right now, and makes the name really long.
(Sorry for chiming in so late, I've been using this patch to be able to run tests on travis, and noticed this. Also--works great! thanks!)

zachyee · 2016-07-19T15:28:18Z

I combined a couple of job groups to get rid of 4 jobs (from 18 to 14). That's the best I can do until tests are faster or someone reaches out to Travis to increase the 50-minute timeout limit.

Did the cluster_type refactor and renaming.

@kokosing thanks for pointing that out. I ran into a similar problem when it wasn't erroring out from lint errors. I went ahead and && commands together so it fails when the first command fails. It works fine for presto-admin because I made it one long if, elif, ... else script, whereas presto needs a bunch of separate scripts with single if statements. If this changes, we can use the fix introduced to presto, but I think this way is more readable. I also removed using tox and invoked nosetests directly. It will use whatever version of python is installed on the travis machine, so we can test 2.6, 2.7.

ebd2 · 2016-07-19T15:33:19Z

tests/product/standalone/presto_installer.py

-            except:
-                # retry once
-                rpm_name = StandalonePrestoInstaller._download_rpm()
+            raise OSError(1, 'Presto RPM not detected.')


nit: you have a testcase where this gets called in the constructor. Does it make sense to pass it in here and testcase.fail() if the RPM doesn't exist?

It might not because it isn't a real test case, if it's messier that way, I'm fine with it as-is.

ebd2 · 2016-07-19T15:35:33Z

LGTM!

rschlussel-zz · 2016-07-19T15:41:30Z

I wouldn't worry about decreasing the number of jobs further. It's better for it not to need so many jobs, but having the whole thing complete in a reasonable amount of time is more important.

petroav · 2016-07-19T17:09:41Z

Looks good

To fix travis test timeout issue

petroav reviewed Jul 7, 2016
View reviewed changes

zachyee force-pushed the fix_travis branch 2 times, most recently from ed95db2 to c11c36d Compare July 7, 2016 23:32

kokosing reviewed Jul 8, 2016
View reviewed changes

petroav reviewed Jul 8, 2016
View reviewed changes

ebd2 reviewed Jul 11, 2016
View reviewed changes

kokosing reviewed Jul 15, 2016
View reviewed changes

zachyee force-pushed the fix_travis branch from 8041303 to b10f903 Compare July 15, 2016 12:33

petroav reviewed Jul 15, 2016
View reviewed changes

Add timing_test_decorator.py

1d4ee79

If a product test case is taking a long time, this decorator can be added to the test and some of its helper function to see where and why the function is taking a long time.

zachyee force-pushed the fix_travis branch from b10f903 to 71c5619 Compare July 15, 2016 22:33

zachyee force-pushed the fix_travis branch from 71c5619 to 6c676d4 Compare July 18, 2016 14:38

ebd2 reviewed Jul 18, 2016
View reviewed changes

zachyee force-pushed the fix_travis branch from 6c676d4 to ef359a7 Compare July 18, 2016 22:31

zachyee force-pushed the fix_travis branch from ef359a7 to 75a2f0f Compare July 19, 2016 15:14

rschlussel-zz reviewed Jul 19, 2016
View reviewed changes

ebd2 reviewed Jul 19, 2016
View reviewed changes

zachyee force-pushed the fix_travis branch from 75a2f0f to 8a478d2 Compare July 19, 2016 16:30

zachyee added 2 commits July 19, 2016 13:36

Refactor product tests to build images before test

9557f8d

To fix travis test timeout issue

Wget presto rpm before running tests

89c6870

zachyee force-pushed the fix_travis branch from 8a478d2 to 89c6870 Compare July 19, 2016 20:30

zachyee merged commit 89c6870 into master Jul 19, 2016

zachyee deleted the fix_travis branch July 19, 2016 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix travis - job timeout and test timeout #203

Fix travis - job timeout and test timeout #203

zachyee commented Jul 7, 2016

petroav Jul 7, 2016 •

edited

Loading

zachyee Jul 7, 2016

kokosing Jul 15, 2016

petroav Jul 15, 2016

kokosing Jul 8, 2016

zachyee Jul 8, 2016

kokosing commented Jul 8, 2016

zachyee commented Jul 8, 2016

petroav Jul 8, 2016

zachyee Jul 8, 2016

petroav commented Jul 8, 2016

ebd2 Jul 11, 2016

kokosing Jul 15, 2016

petroav Jul 15, 2016

kokosing commented Jul 15, 2016

petroav Jul 15, 2016

zachyee commented Jul 15, 2016

petroav commented Jul 17, 2016

ebd2 Jul 18, 2016

ebd2 commented Jul 18, 2016

kokosing commented Jul 19, 2016

rschlussel-zz Jul 19, 2016

petroav Jul 19, 2016

zachyee commented Jul 19, 2016 •

edited

Loading

ebd2 Jul 19, 2016

ebd2 commented Jul 19, 2016

rschlussel-zz commented Jul 19, 2016

petroav commented Jul 19, 2016

		from time import time


		def _initialize_module_logger(module_name):

Fix travis - job timeout and test timeout #203

Fix travis - job timeout and test timeout #203

Conversation

zachyee commented Jul 7, 2016

petroav Jul 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kokosing commented Jul 8, 2016

zachyee commented Jul 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petroav commented Jul 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kokosing commented Jul 15, 2016

Choose a reason for hiding this comment

zachyee commented Jul 15, 2016

petroav commented Jul 17, 2016

Choose a reason for hiding this comment

ebd2 commented Jul 18, 2016

kokosing commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachyee commented Jul 19, 2016 • edited Loading

Choose a reason for hiding this comment

ebd2 commented Jul 19, 2016

rschlussel-zz commented Jul 19, 2016

petroav commented Jul 19, 2016

petroav Jul 7, 2016 •

edited

Loading

zachyee commented Jul 19, 2016 •

edited

Loading